Web Scraping with R & rvest

Dr. Matthew Hendrickson

July 9, 2020

About Me

  • Dr. Matthew Hendrickson
  • Social Scientist by Training
    • Psychology & Music %>%
    • More Psychology %>%
    • Law & Policy
  • Professional Experience (13+ years)
    • Higher Education Analyst
    • Independent Consultant
    • Research projects, data analysis, policy development, strategy, analytics pipeline solutions

Topics

  1. A Little About Web Scraping
  2. Robots!
  3. HTML & CSS
  4. The Setup
  5. Scraping the Data
  6. Assembling the Data
  7. References & Resources

A Little About Web Scraping

“Web scraping is the process of automatically mining data or collecting information from the World Wide Web.”

– Wikipedia


Web scraping is a flexible method to extract data from the internet. It can involve extracting numerical or text data.

Use Cases

There are many uses for web scraping, including but not limited to:

  1. Price monitoring
  2. Sentiment analysis
  3. Time series tracking and analysis
  4. Brand monitoring
  5. Market analysis
  6. Lead generation

Robots!

  • No, not those robots!
  • Always ensure - PRIOR to scraping - that you have scraping rights!
  • This is critical - you can be blocked or even face legal action!

Robots.txt

Good news! You can easily check with the robotstxt package.

paths_allowed(paths = c("https://netflix.com/"))
#> [1] FALSE

Netflix does not allow you to scrape their site.

HTML & CSS

“HTML is the standard markup language for creating Web pages.”

– W3Schools


“CSS describes how HTML elements are to be displayed on screen, paper, or in other media.”

– W3Schools

HTML Structure

Image credit: Professor Shawn Santo

HTML Tags

HTML is structured with “tags,” which indicate portions of the page and can be called by their structure.

There are many types of tags - here are some important ones for scraping:

  • <h1> - header tags
  • <p> - paragraph elements
  • <ul> - unordered bulleted list
  • <ol> - ordered list
  • <li> - individual list item
  • <div> - division
  • <table> - table

A Little Help with CSS

If you aren’t familiar with CSS, extracting parts of a website can be daunting.

SelectorGadget is incredibly helpful. However, it is only available for Chrome.

Inspect the page elements is also helpful, which is available as a developer tool for most major browsers.

Scraping Methods

HTML - syntax is easier and aligns with HTML tags

XPATH - useful when the node isn’t uniquely identified with CSS

The Setup

Set up the environment to scrape the site.

library(tidyverse)
library(robotstxt)
library(rvest)

That’s it!

Determine a website to scrape

It only seems appropriate to pull data from R books on Amazon.

Ensure we can scrape the site.

paths_allowed(paths = c("https://amazon.com/"))
#> [1] TRUE


We are good to scrape!

Setting the URL

Before you get started, you must specificy the URL.

Data as of 2020-07-06.

amazon <- read_html("https://www.amazon.com/s?k=R&i=stripbooks&rh=n%3A283155%2Cn%3A75%2Cn%3A13983&dc&qid=1592086532&rnid=1000&ref=sr_nr_n_1")

Titles

Scraping Book Titles

amazon %>% 
  html_nodes(".s-line-clamp-2") %>% 
  html_text() -> titles
head(titles)
#> [1] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                GANs in Action: Deep learning with Generative Adversarial Networks\n            \n        \n        \n    \n\n\n    \n"                                             
#> [2] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                R for Data Science: Import, Tidy, Transform, Visualize, and Model Data\n            \n        \n        \n    \n\n\n    \n"                                         
#> [3] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                The Book of R: A First Course in Programming and Statistics\n            \n        \n        \n    \n\n\n    \n"                                                    
#> [4] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                R Graphics Cookbook: Practical Recipes for Visualizing Data\n            \n        \n        \n    \n\n\n    \n"                                                    
#> [5] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics)\n            \n        \n        \n    \n\n\n    \n"                 
#> [6] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                Statistical Inference via Data Science: A ModernDive into R and the Tidyverse (Chapman & Hall/CRC The R Series)\n            \n        \n        \n    \n\n\n    \n"

The element pulls a number of breaks and blank spaces.

Let’s clean this up with str_trim.

Removing white space and breaks (\n) from the Titles

titles <- str_trim(titles) # Removes leading & trailing space
head(titles)
#> [1] "GANs in Action: Deep learning with Generative Adversarial Networks"                                             
#> [2] "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data"                                         
#> [3] "The Book of R: A First Course in Programming and Statistics"                                                    
#> [4] "R Graphics Cookbook: Practical Recipes for Visualizing Data"                                                    
#> [5] "An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics)"                 
#> [6] "Statistical Inference via Data Science: A ModernDive into R and the Tidyverse (Chapman & Hall/CRC The R Series)"

This simple function returns cleaned text.

Formats

Scraping the Book Format

amazon %>% 
  html_nodes("a.a-size-base.a-link-normal.a-text-bold") %>% 
  html_text() -> format
head(format)
#> [1] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [2] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [3] "\n    \n        \n        \n            Kindle\n        \n    \n"   
#> [4] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [5] "\n    \n        \n        \n            eTextbook\n        \n    \n"
#> [6] "\n    \n        \n        \n            Paperback\n        \n    \n"

Clean up book format values

format <- str_trim(format)
head(format)
#> [1] "Paperback" "Paperback" "Kindle"    "Paperback" "eTextbook" "Paperback"

Price

Scraping the Book Price

The price structure splits price into two elements. We must pull each and combine them into a single price.

amazon %>% 
  html_nodes(".a-price-whole") %>% 
  html_text() -> price_whole
head(price_whole)
#> [1] "40." "40." "24." "33." "29." "23."

Scraping (the rest of) the Book Price

amazon %>% 
  html_nodes(".a-price-fraction") %>% 
  html_text() -> price_fraction
head(price_fraction)
#> [1] "86" "10" "99" "04" "99" "93"

Combine Price Portions

price <- paste(price_whole, price_fraction, sep = "")
head(price)
#> [1] "40.86" "40.10" "24.99" "33.04" "29.99" "23.93"

Make it numeric

price <- as.numeric(price)
head(price)
#> [1] 40.86 40.10 24.99 33.04 29.99 23.93

Rating

Scraping the Book Rating

amazon %>% 
  html_nodes("i.a-icon.a-icon-star-small.aok-align-bottom") %>% 
  html_text() -> rating
head(rating)
#> [1] "4.1 out of 5 stars" "4.7 out of 5 stars" "4.3 out of 5 stars"
#> [4] "4.7 out of 5 stars" "4.7 out of 5 stars" "5.0 out of 5 stars"

Let’s trim this into a usable metric

rating <- substr(rating, 1, 3) # Takes 3 characters starting at 1
head(rating)
#> [1] "4.1" "4.7" "4.3" "4.7" "4.7" "5.0"

Make it numeric

rating <- as.numeric(rating)
head(rating)
#> [1] 4.1 4.7 4.3 4.7 4.7 5.0

Rating Counts

Scraping the Book Rating Count

This element is messier and we’ll need a number of cleaning steps.

amazon %>% 
  html_nodes("div.a-row.a-size-small") %>% 
  html_text() -> rate_n
head(rate_n)
#> [1] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.1 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                9\n            \n        \n        \n    \n\n\n\n"  
#> [2] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                427\n            \n        \n        \n    \n\n\n\n"
#> [3] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                76\n            \n        \n        \n    \n\n\n\n" 
#> [4] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14\n            \n        \n        \n    \n\n\n\n" 
#> [5] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                551\n            \n        \n        \n    \n\n\n\n"
#> [6] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            5.0 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                3\n            \n        \n        \n    \n\n\n\n"

Trim the Rating Count

rate_n <- str_trim(rate_n) # trim \n & ' '
head(rate_n)
#> [1] "4.1 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                9"  
#> [2] "4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                427"
#> [3] "4.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                76" 
#> [4] "4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14" 
#> [5] "4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                551"
#> [6] "5.0 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                3"

Rating Count - Substring

rate_n <- str_sub(rate_n, -5) # keep last 5 characters
head(rate_n)
#> [1] "    9" "  427" "   76" "   14" "  551" "    3"

Trim the Rating Count (Again)

rate_n <- str_trim(rate_n) # trim leading spaces
head(rate_n)
#> [1] "9"   "427" "76"  "14"  "551" "3"

Set as Numeric

rate_n <- as.numeric(rate_n)
head(rate_n)
#> [1]   9 427  76  14 551   3

Publication Date

Scraping the Book Publication Date

amazon %>% 
  html_nodes("span.a-size-base.a-color-secondary.a-text-normal") %>% 
  html_text() -> pub_dt
head(pub_dt)
#> [1] "Oct 8, 2019"  "Jan 10, 2017" "Jul 16, 2016" "Nov 30, 2018" "Sep 1, 2017" 
#> [6] "Dec 13, 2019"

Convert to a date

pub_dt <- as.Date(pub_dt, "%b %d, %Y")
head(pub_dt)
#> [1] "2019-10-08" "2017-01-10" "2016-07-16" "2018-11-30" "2017-09-01"
#> [6] "2019-12-13"

We Have the Pieces

Let’s assemble the file!

  1. Titles
  2. Formats
  3. Prices
  4. Ratings
  5. Rating Counts
  6. Publication Date

Let’s Check the Scrapes

length(titles)
#> [1] 18
length(format)
#> [1] 37
length(price)
#> [1] 37
length(rating)
#> [1] 16
length(rate_n)
#> [1] 16
length(pub_dt)
#> [1] 18

Wait! What?!?

An issue with scraping is sometimes you get an uneven number of records due to missing data elements.

We can fix this!

  • …manually…

Fixing the Scrapes

Titles

All titles were populated and scraped accurately. However, due to multiple formats, these records must be repeated to fill the dataframe.

titles %>% 
  rep(, each = 2) -> titles
length(titles)
#> [1] 36

Titles

Some titles have only 1 format.

titles <- titles [-c(10, 20, 30, 34, 40)]
length(titles)
#> [1] 32

Titles

Some titles have more than 2 formats.

titles %>% 
  append(values = titles[33], after = 33) %>% 
  append(values = titles[25], after = 25) %>% 
  append(values = titles[12], after = 12) %>% 
  append(values = titles[10], after = 10) %>% 
  append(values = titles[5], after = 5) -> titles
length(titles)
#> [1] 37

Formats

Nothing needed here!

length(format)
#> [1] 37

Prices

Or here!

length(price)
#> [1] 37

Ratings

Some books don’t have ratings.

A book only has one rating even if it has multiple formats.

We must also account for multiple formats.

rating %>% 
  append(values = NA, after = 14) %>% 
  append(values = NA, after = 12) %>% 
  append(values = NA, after = 12) -> rating
length(rating)
#> [1] 19

Ratings

Like titles, the ratings need to be repeated to show on the correct row.

The same corrections are done here.

rating %>% 
  rep(, each = 2) -> rating
length(rating)
#> [1] 38

Ratings

Some books have only 1 format.

rating <- rating [-c(10, 20, 30, 34, 40)]
length(rating)
#> [1] 34

Ratings

Some books have more than 2 formats.

rating %>% 
  append(values = rating[33], after = 33) %>% 
  append(values = rating[25], after = 25) %>% 
  append(values = rating[12], after = 12) %>% 
  append(values = rating[10], after = 10) %>% 
  append(values = rating[5], after = 5) -> rating
length(rating)
#> [1] 39

Rating Counts

Not all titles have a rating.

rate_n %>% 
  append(values = NA, after = 14) %>% 
  append(values = NA, after = 12) %>% 
  append(values = NA, after = 12) -> rate_n
length(rate_n)
#> [1] 19

Rating Counts

Like titles, the ratings need to be repeated to show on the correct row.

The same corrections are done here.

rate_n %>% 
  rep(, each = 2) -> rate_n
length(rate_n)
#> [1] 38

Rating Counts

Some books have only 1 format.

rate_n <- rate_n [-c(10, 20, 30, 34, 40)]
length(rate_n)
#> [1] 34

Rating Counts

Some books have more than 2 formats.

rate_n %>% 
  append(values = rate_n[33], after = 33) %>% 
  append(values = rate_n[25], after = 25) %>% 
  append(values = rate_n[12], after = 12) %>% 
  append(values = rate_n[10], after = 10) %>% 
  append(values = rate_n[5], after = 5) -> rate_n
length(rate_n)
#> [1] 39

Publication Date

Create extra rows due to multiple book formats.

pub_dt %>% 
  rep(, each = 2) -> pub_dt
length(pub_dt)
#> [1] 36

Publication Date

Some books have only 1 format.

pub_dt <- pub_dt [-c(10, 20, 30, 34, 40)]
length(pub_dt)
#> [1] 32

Publication Date

Some books have more than 2 formats.

pub_dt %>% 
  append(values = pub_dt[33], after = 33) %>% 
  append(values = pub_dt[25], after = 25) %>% 
  append(values = pub_dt[12], after = 12) %>% 
  append(values = pub_dt[10], after = 10) %>% 
  append(values = pub_dt[5], after = 5) -> pub_dt
length(pub_dt)
#> [1] 37

One More Check!

length(titles)
#> [1] 37
length(format)
#> [1] 37
length(price)
#> [1] 37
length(rating)
#> [1] 39
length(rate_n)
#> [1] 39
length(pub_dt)
#> [1] 37

(Finally) Assemble the Data

r_books <- tibble(title            = titles,
                  text_format      = format,
                  price            = price,
                  rating           = rating,
                  num_ratings      = rate_n,
                  publication_date = pub_dt)
head(r_books)
#> # A tibble: 6 x 6
#>   title                    text_format price rating num_ratings publication_date
#>   <chr>                    <chr>       <dbl>  <dbl>       <dbl> <date>          
#> 1 R for Data Science: Imp~ Paperback    40.1    4.7         427 2017-01-10      
#> 2 R for Data Science: Imp~ Kindle       25.0    4.7         427 2017-01-10      
#> 3 The Book of R: A First ~ Paperback    33.0    4.3          76 2016-07-16      
#> 4 The Book of R: A First ~ eTextbook    30.0    4.3          76 2016-07-16      
#> 5 Discovering Statistics ~ Paperback    34.5    4.5         255 2012-04-05      
#> 6 Discovering Statistics ~ Kindle       61.6    4.5         255 2012-04-05

Thank you


@mjhendrickson


matthewjhendrickson


mjhendrickson


Web Scraping in R & rvest repo

This talk is freely distributed under the MIT License.

References & Resources

References & Resources continued